Audio-visual Speech Processing
نویسنده
چکیده
Speech is inherently bimodal, relying on cues from the acoustic and visual speech modalities for perception. The McGurk effect demonstrates that when humans are presented with conflicting acoustic and visual stimuli, the perceived sound may not exist in either modality. This effect has formed the basis for modelling the complementary nature of acoustic and visual speech by encapsulating them into the relatively new research field of audio-visual speech processing (AVSP). Traditional acoustic based speech processing systems have attained a high level of performance in recent years, but the performance of these systems is heavily dependent on a match between training and testing conditions. In the presence of mismatched conditions (eg. acoustic noise) the performance of acoustic speech processing applications can degrade markedly. AVSP aims to increase the robustness and performance of conventional speech processing applications through the integration of the acoustic and visual modalities of speech, in particular the tasks of isolated word speech and text-dependent speaker recognition. Two major problems in AVSP are addressed in this thesis, the first of which concerns the extraction of pertinent visual features for effective speech reading and visual speaker recognition. Appropriate representations of the mouth are explored for improved classification performance for speech and speaker recognition. Secondly, there is the question of how to effectively integrate the acoustic and visual speech modalities for robust and improved performance. This question is explored in-depth using hidden Markov model (HMM) classifiers. The developii ment and investigation of integration strategies for AVSP required research into a new branch of pattern recognition known as classifier combination theory. In this thesis a novel framework is presented for optimally combining classifiers so their combined performance is greater than any of those classifiers individually. The benefits of this framework are not restricted to AVSP, as they can be applied to any task where there is a need for combining independent classifiers.
منابع مشابه
An Audio-visual Speech Recognition System for Testing New Audio-visual Databases
For past several decades, visual speech signal processing has been an attractive research topic for overcoming certain audio-only recognition problems. In recent years, there have been many automatic speech-reading systems proposed that combine audio and visual speech features. For all such systems, the objective of these audio-visual speech recognizers is to improve recognition accuracy, parti...
متن کاملEvent-Related Potentials Associated with Somatosensory Effect in Audio-Visual Speech Perception
Speech perception often involves multisensory processing. Although previous studies have demonstrated visual [1, 2] and somatosensory interactions [3, 4] with auditory processing, it is not clear whether somatosensory information can contribute to the processing of audio-visual speech perception. This study explored the neural consequence of somatosensory interactions in audio-visual speech pro...
متن کاملCipher text only attack on speech time scrambling systems using correction of audio spectrogram
Recently permutation multimedia ciphers were broken in a chosen-plaintext scenario. That attack models a very resourceful adversary which may not always be the case. To show insecurity of these ciphers, we present a cipher-text only attack on speech permutation ciphers. We show inherent redundancies of speech can pave the path for a successful cipher-text only attack. To that end, regularities ...
متن کاملEffective visually-derived Wiener filtering for audio-visual speech processing
This work presents a novel approach to speech enhancement by exploiting the bimodality of speech and the correlation that exists between audio and visual speech features. For speech enhancement, a visually-derived Wiener filter is developed. This obtains clean speech statistics from visual features by modelling their joint density and making a maximum a posteriori estimate of clean audio from v...
متن کاملMeasuring Audio and Visual Speech Synchrony: Methods and Applications
Speech is a means of communication that is intrinsically bimodal: the audio signal originates from the dynamics of the articulators. This paper reviews recent works in the field of audiovisual speech and more specifically on techniques developed to measure the level of correspondence between audio and visual speech. It overviews the most common audio and visual speech front-end processing, tran...
متن کاملMeasuring Audio and Visual Speech Synchrony: Methods
Speech is a means of communication that is intrinsically bimodal: the audio signal originates from the dynamics of the articulators. This paper reviews recent works in the field of audiovisual speech and more specifically on techniques developed to measure the level of correspondence between audio and visual speech. It overviews the most common audio and visual speech front-end processing, tran...
متن کامل